Summary

This workflow demonstrates the mitosRNAseq pipeline which identifies several types of small RNA from sequencing data from short reads of <50 nt. The following method collects sequence-based counts, sequence locations, genomic features, and annotations. The data we use is from killifish embyros anoxia experiments.

The step-by-step workflow we follow is illistrated below and a detailed method to perform small RNA seq on an academic HPC is decsribed in this document.

The following table shows the contents of the file samples.csv that contains metadata for our samples. We start with a total of 40 fastq files.

Sample Table

# Sample Condition Stage Input files
1 31_4d_t0 4d_t0 4 31_4d_t0.fastq.gz
2 32_4d_t0 4d_t0 4 32_4d_t0.fastq.gz
3 33_4d_t0 4d_t0 4 33_4d_t0.fastq.gz
4 34_4d_t0 4d_t0 4 34_4d_t0.fastq.gz
5 35_4d_4hrA 4d_4hrA 4 35_4d_4hrA.fastq.gz
6 36_4d_4hrA 4d_4hrA 4 36_4d_4hrA.fastq.gz
7 37_4d_4hrA 4d_4hrA 4 37_4d_4hrA.fastq.gz
8 38_4d_4hrA 4d_4hrA 4 38_4d_4hrA.fastq.gz
9 39_4d_24hrA 4d_24hrA 4 39_4d_24hrA.fastq.gz
10 40_4d_24hrA 4d_24hrA 4 40_4d_24hrA.fastq.gz
11 41_4d_24hrA 4d_24hrA 4 41_4d_24hrA.fastq.gz
12 42_4d_24hrA 4d_24hrA 4 42_4d_24hrA.fastq.gz
13 43_4d_2hrR 4d_2hrR 4 43_4d_2hrR.fastq.gz
14 44_4d_2hrR 4d_2hrR 4 44_4d_2hrR.fastq.gz
15 45_4d_2hrR 4d_2hrR 4 45_4d_2hrR.fastq.gz
16 46_4d_2hrR 4d_2hrR 4 46_4d_2hrR.fastq.gz
17 47_4d_24hrR 4d_24hrR 4 47_4d_24hrR.fastq.gz
18 48_4d_24hrR 4d_24hrR 4 48_4d_24hrR.fastq.gz
19 49_4d_24hrR 4d_24hrR 4 49_4d_24hrR.fastq.gz
20 50_4d_24hrR 4d_24hrR 4 50_4d_24hrR.fastq.gz
21 1_12d_t0 12d_t0 12 1_12d_t0.fastq.gz
22 2_12d_t0 12d_t0 12 2_12d_t0.fastq.gz
23 5_12d_t0 12d_t0 12 5_12d_t0.fastq.gz
24 6_12d_t0 12d_t0 12 6_12d_t0.fastq.gz
25 10_12d_4hrA 12d_4hrA 12 10_12d_4hrA.fastq.gz
26 11_12d_4hrA 12d_4hrA 12 11_12d_4hrA.fastq.gz
27 12_12d_4hrA 12d_4hrA 12 12_12d_4hrA.fastq.gz
28 7_12d_4hrA 12d_4hrA 12 7_12d_4hrA.fastq.gz
29 13_12d_24hrA 12d_24hrA 12 13_12d_24hrA.fastq.gz
30 16_12d_24hrA 12d_24hrA 12 16_12d_24hrA.fastq.gz
31 17_12d_24hrA 12d_24hrA 12 17_12d_24hrA.fastq.gz
32 18_12d_24hrA 12d_24hrA 12 18_12d_24hrA.fastq.gz
33 19_12d_2hrR 12d_2hrR 12 19_12d_2hrR.fastq.gz
34 21_12d_2hrR 12d_2hrR 12 21_12d_2hrR.fastq.gz
35 22_12d_2hrR 12d_2hrR 12 22_12d_2hrR.fastq.gz
36 23_12d_2hrR 12d_2hrR 12 23_12d_2hrR.fastq.gz
37 25_12d_24hrR 12d_24hrR 12 25_12d_24hrR.fastq.gz
38 28_12d_24hrR 12d_24hrR 12 28_12d_24hrR.fastq.gz
39 29_12d_24hrR 12d_24hrR 12 29_12d_24hrR.fastq.gz
40 30_12d_24hrR 12d_24hrR 12 30_12d_24hrR.fastq.gz

QC for raw reads

Quality Control (QC) is done by running fastQC on raw reads to evaluate how good (or bad) our data looks.

Install software for Quality Check

conda

Set up a virtual environment in conda that has all of the Quality Control (QC) tools. We are using Miniconda. For creating sharable environments, create .yaml files for each environment you’re making.

conda create -n condafastqc fastqc
conda activate condafastqc
conda install -n multiqc

tmux

For almost every long-ish process, we use tmux to run tasks in a detached terminal window that can keep running if we want to log out or if we accidently log out.

tmux new -s fastqc

Executing

Creating a new file called fastqcr.R that has the script to run fastqc through R. We use this script to automate the process for a list of files in a directory. Running commands through various scripts also keeps a record of our commands.

if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/init_fastq_files",
      qc.dir="/path/to/directory/fastqc/init_fastqc",
      threads=20)

Running the script with the following command:

Rscript fastqcr.R

Press Ctrl+B and then D for detaching this tmux window. You can log out from your remote computer (HPC) and close your terminal window now while this process keeps running on your remote computer. To get back to this window, log in to your remote computer, and use the command tmux a -t fastqc. F or more information, check tmux.

We will get a fastqc HTML file in the output fastqc directory for each fastq sequence in the input init_fastq_files directory.

FastQC Output

General Statistics from multiqc

Sample PercentDups PercentGC BPMedianReadLength MillionsSeqs
10_12d_4hrA 94.1 51 36 12.6
11_12d_4hrA 92.7 51 36 16.2
12_12d_4hrA 87.4 52 36 9.0
13_12d_24hrA 90.5 53 36 8.5
16_12d_24hrA 91.1 50 36 7.9
17_12d_24hrA 92.4 50 36 29.9
18_12d_24hrA 84.6 52 36 11.8
19_12d_2hrR 87.6 52 36 11.6
1_12d_t0 86.4 51 36 8.7
21_12d_2hrR 95.4 54 36 9.8
22_12d_2hrR 80.5 52 36 6.7
23_12d_2hrR 89.3 52 36 7.8
25_12d_24hrR 88.1 53 36 13.5
28_12d_24hrR 91.1 55 36 5.9
29_12d_24hrR 91.4 54 36 8.3
2_12d_t0 91.5 54 36 8.3
30_12d_24hrR 79.9 53 36 12.8
31_4d_t0 79.3 52 50 11.9
32_4d_t0 80.1 53 50 11.6
33_4d_t0 86.3 51 50 13.7
34_4d_t0 82.9 52 50 15.2
35_4d_4hrA 88.4 52 50 11.0
36_4d_4hrA 87.0 51 50 13.8
37_4d_4hrA 84.7 52 50 15.6
38_4d_4hrA 84.4 51 50 15.3
39_4d_24hrA 79.7 53 50 12.4
40_4d_24hrA 74.4 51 50 12.9
41_4d_24hrA 79.3 52 50 9.7
42_4d_24hrA 85.6 53 50 16.7
43_4d_2hrR 88.3 52 50 13.0
44_4d_2hrR 81.5 44 101 11.0
45_4d_2hrR 87.9 44 101 13.8
46_4d_2hrR 88.9 52 50 16.9
47_4d_24hrR 72.8 45 101 14.6
48_4d_24hrR 83.2 53 50 13.4
49_4d_24hrR 87.1 51 50 22.8
50_4d_24hrR 88.9 41 101 14.8
5_12d_t0 81.3 53 36 11.2
6_12d_t0 81.7 53 36 13.6
7_12d_4hrA 81.4 52 36 10.5

Sequence Counts for each sample

Sequence Counts

Sequence Counts

The mean quality scores per base in the read

Per Base Quality

Per Base Quality

The average quality scores per sequence

Per Sequence Quality

Per Sequence Quality

The GC content per sequence

GC conetent

GC conetent

Sequence Duplication

Duplication in each sequence

Duplication in each sequence

Overrepresented sequences

Total overrepresented sequences found in each sample

Total overrepresented sequences found in each sample

Adapter Content

Adapters found at sequence positions

Adapters found at sequence positions

Overall Pass/Fail

Checking if each sample passes or fails in various meteris

Checking if each sample passes or fails in various meteris

Sequence length distribution

All samples had sequences of a single length (50bp , 36bp , 101bp).

Trimming

Trimming is done with Trimmomatic in our case, however, other tools like cutadapt or fastp may also be used. This tool runs through the following bash script

FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/init_fastq_files/*.fastq.gz"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b

        c=${b::-9}
        o="$c.trim.fastq"
    log="$c.out.log"
        c="$c.log.txt"
        java -jar /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 20 -phred33 -trimlog $c $f $o ILLUMINACLIP:/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/adapter_contamination_sequences_AR3.txt:2:30:5:1:true SLIDINGWINDOW:5:15 LEADING:20 TRAILING:20 MINLEN:15 2> $log
    done
Reads Surviving

Reads Surviving

Sample TotalReads SurvivingReads PercentSurviving DroppedReads PercentDropped
10_12d_4hrA 12625447 4421375 35.0 8204072 65.0
11_12d_4hrA 16238016 8893201 54.8 7344815 45.2
12_12d_4hrA 9024387 7219072 80.0 1805315 20.0
13_12d_24hrA 8467459 5182199 61.2 3285260 38.8
16_12d_24hrA 7908381 6458185 81.7 1450196 18.3
17_12d_24hrA 29872147 17868683 59.8 12003464 40.2
18_12d_24hrA 11793810 8182929 69.4 3610881 30.6
19_12d_2hrR 11606761 9574385 82.5 2032376 17.5
1_12d_t0 8698124 6498924 74.7 2199200 25.3
21_12d_2hrR 9784686 3585294 36.6 6199392 63.4
22_12d_2hrR 6741253 4728213 70.1 2013040 29.9
23_12d_2hrR 7836111 4455897 56.9 3380214 43.1
25_12d_24hrR 13534542 8957294 66.2 4577248 33.8
28_12d_24hrR 5866632 2402889 41.0 3463743 59.0
29_12d_24hrR 8305061 2923930 35.2 5381131 64.8
2_12d_t0 8344087 6946381 83.2 1397706 16.8
30_12d_24hrR 12811149 8785705 68.6 4025444 31.4
31_4d_t0 11936271 10306761 86.3 1629510 13.7
32_4d_t0 11595360 7974434 68.8 3620926 31.2
33_4d_t0 13655749 10422782 76.3 3232967 23.7
34_4d_t0 15201813 11186661 73.6 4015152 26.4
35_4d_4hrA 11045591 7821501 70.8 3224090 29.2
36_4d_4hrA 13840101 9383343 67.8 4456758 32.2
37_4d_4hrA 15579833 12373672 79.4 3206161 20.6
38_4d_4hrA 15286687 12142640 79.4 3144047 20.6
39_4d_24hrA 12442304 9755623 78.4 2686681 21.6
40_4d_24hrA 12866355 11130799 86.5 1735556 13.5
41_4d_24hrA 9720710 8032689 82.6 1688021 17.4
42_4d_24hrA 16698310 12979297 77.7 3719013 22.3
43_4d_2hrR 12989273 9158318 70.5 3830955 29.5
44_4d_2hrR 10968948 8421811 76.8 2547137 23.2
45_4d_2hrR 13812376 6527219 47.3 7285157 52.7
46_4d_2hrR 16946884 12011097 70.9 4935787 29.1
47_4d_24hrR 14627707 12933658 88.4 1694049 11.6
48_4d_24hrR 13383720 10269804 76.7 3113916 23.3
49_4d_24hrR 22803428 18172334 79.7 4631094 20.3
50_4d_24hrR 14759393 7495120 50.8 7264273 49.2
5_12d_t0 11217968 7594023 67.7 3623945 32.3
6_12d_t0 13607922 9592402 70.5 4015520 29.5
7_12d_4hrA 10489257 6515199 62.1 3974058 37.9

QC for trimmed reads

Quality Control (QC) is done for the trimmed reads with fastQC. We activate the conda environment condafastqc for this process.

if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/trimmed_fastq_files",
      qc.dir="/path/to/directory/fastqc/trim_fastqc",
      threads=20)

Running the script with the following command:

Rscript fastqcr.R

FastQC Output

Plots for each FASTQC metric after trimming is shown below. All the adapters were removed and overall quality looks better.

General Statistics from multiqc

Sample PercentDups PercentGC BPMedianReadLength MillionsSeqs
10_12d_4hrA 89.4 50 22 4.4
11_12d_4hrA 90.3 49 22 8.9
12_12d_4hrA 88.7 51 22 7.2
13_12d_24hrA 88.4 50 22 5.2
16_12d_24hrA 91.9 48 22 6.5
17_12d_24hrA 91.2 48 22 17.9
18_12d_24hrA 84.9 50 22 8.2
19_12d_2hrR 88.5 51 22 9.6
1_12d_t0 86.7 49 22 6.5
21_12d_2hrR 90.5 47 22 3.6
22_12d_2hrR 80.9 50 22 4.7
23_12d_2hrR 86.9 49 22 4.5
25_12d_24hrR 86.6 50 22 9.0
28_12d_24hrR 84.3 50 22 2.4
29_12d_24hrR 83.9 50 22 2.9
2_12d_t0 93.8 53 22 6.9
30_12d_24hrR 78.7 51 22 8.8
31_4d_t0 80.8 49 22 10.3
32_4d_t0 77.1 50 22 8.0
33_4d_t0 86.5 48 22 10.4
34_4d_t0 81.7 51 22 11.2
35_4d_4hrA 86.5 50 22 7.8
36_4d_4hrA 82.7 49 22 9.4
37_4d_4hrA 84.8 51 22 12.4
38_4d_4hrA 84.2 48 22 12.1
39_4d_24hrA 79.1 51 23 9.8
40_4d_24hrA 73.6 50 27 11.1
41_4d_24hrA 79.0 49 22 8.0
42_4d_24hrA 85.3 50 22 13.0
43_4d_2hrR 86.9 50 22 9.2
44_4d_2hrR 82.1 50 20 8.4
45_4d_2hrR 80.0 50 20 6.5
46_4d_2hrR 88.4 48 22 12.0
47_4d_24hrR 73.3 48 26 12.9
48_4d_24hrR 82.6 52 22 10.3
49_4d_24hrR 87.9 49 22 18.2
50_4d_24hrR 84.4 49 20 7.5
5_12d_t0 81.2 50 22 7.6
6_12d_t0 82.7 51 22 9.6
7_12d_4hrA 79.5 50 22 6.5

Sequence Counts for each sample

Sequence Counts

Sequence Counts

The mean quality scores per base in the read

Per Base Quality

Per Base Quality

The average quality scores per sequence

Per Sequence Quality

Per Sequence Quality

The GC content per sequence

GC conetent

GC conetent

Sequence Duplication

Duplication in each sequence

Duplication in each sequence

Overrepresented sequences

Total overrepresented sequences found in each sample

Total overrepresented sequences found in each sample

Overall Pass/Fail

Checking if each sample passes or fails in various meteris

Checking if each sample passes or fails in various meteris

Sequence length distribution

length distribution of sequences

length distribution of sequences

Alignment

First things first- create a conda environment that has bowtie, samtools, picard, and bedtools.

conda create -n condabowtie bowtie samtools picard bedtools

The trimmed reads are aligned to the reference genome using bowtie. To analyze reads mapping with mitochondrial genome, we perform the alignment in two steps.

  1. Complete Genome (with mitochondrial genome)- with algenome.fa
  2. Only the Mitochondrial genome which is extracted from genome.fa- almitogenome.fa

To begin alignment, we make bowtie indexes, this is done in two sets.

bowtie-index algenome.fa
bowtie-index almitogenome.fa

Alignment with genome

The alignment was done on the trimmed reads from previous section.

#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b
        t=${b::-6}
        o="$t.sam"
        uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
     -p 20 \
         -t \
     -k 5 \
     --best \
     --strata \
     -e 99999 \
     -v 0 \
     -l 15 \
     --chunkmbs 2048 \
     -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_index_alim/algenome \
     -q $f \
     --al $uo \
     --sam --no-unal > $o 2> $t.bowtie.log
    done
#done

The following stats show alignment QC.

Bowtie Alignment Stats with Genome

Bowtie Alignment Stats with Genome

This table shows the number of reads aligned in each sample (MillionAlignedReads). It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

Sample PercentAligned MillionAlignedReads MillionMappedReads
10_12d_4hrA 73.2 3.2 8.0
11_12d_4hrA 73.6 6.5 15.4
12_12d_4hrA 70.4 5.1 12.4
13_12d_24hrA 69.6 3.6 8.7
16_12d_24hrA 77.5 5.0 10.3
17_12d_24hrA 75.8 13.5 31.4
18_12d_24hrA 73.7 6.0 15.0
19_12d_2hrR 71.9 6.9 16.8
1_12d_t0 73.3 4.8 11.1
21_12d_2hrR 77.8 2.8 5.3
22_12d_2hrR 66.2 3.1 7.4
23_12d_2hrR 73.2 3.3 7.8
25_12d_24hrR 69.7 6.2 15.0
28_12d_24hrR 65.3 1.6 3.7
29_12d_24hrR 70.9 2.1 5.0
2_12d_t0 76.9 5.3 13.1
30_12d_24hrR 65.5 5.8 13.8
31_4d_t0 62.0 6.4 11.8
32_4d_t0 73.5 5.9 12.2
33_4d_t0 80.6 8.4 15.0
34_4d_t0 70.5 7.9 15.8
35_4d_4hrA 71.5 5.6 9.2
36_4d_4hrA 62.1 5.8 8.3
37_4d_4hrA 69.1 8.6 16.4
38_4d_4hrA 76.1 9.2 15.4
39_4d_24hrA 79.0 7.7 17.2
40_4d_24hrA 52.3 5.8 10.9
41_4d_24hrA 63.0 5.1 8.8
42_4d_24hrA 66.6 8.6 15.2
43_4d_2hrR 71.2 6.5 11.7
44_4d_2hrR 64.6 5.4 11.8
45_4d_2hrR 65.7 4.3 9.5
46_4d_2hrR 71.2 8.6 13.8
47_4d_24hrR 80.1 10.4 24.4
48_4d_24hrR 51.8 5.3 10.7
49_4d_24hrR 72.3 13.1 23.2
50_4d_24hrR 79.4 6.0 13.4
5_12d_t0 66.2 5.0 11.9
6_12d_t0 67.5 6.5 15.5
7_12d_4hrA 65.6 4.3 10.1

Alignment with mitochondria

The alignment was done on the trimmed reads from previous section.

The following stats show alignment QC.

#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"   
for f in $FILES
    do
        b=`basename $f`
        echo Running alignment for the file $b
        t=${b::-6}
        o="$t.sam"
    uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
     -p 20 \
         -t \
     -k 10 \
     --best \
     --strata \
     -e 99999 \
     -v 0 \
     -l 15 \
     --chunkmbs 2048 \
     -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
     -q $f \
     --al $uo \
     --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
Bowtie Alignment Stats with Mitochondrial Genome

Bowtie Alignment Stats with Mitochondrial Genome

This table shows the number of reads aligned in each sample (MillionAlignedReads). It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

Sample PercentAligned MillionAlignedReads MillionMappedReads
10_12d_4hrA 0.34 0.01 0.02
11_12d_4hrA 0.47 0.04 0.04
12_12d_4hrA 0.41 0.03 0.03
13_12d_24hrA 0.50 0.03 0.03
16_12d_24hrA 0.52 0.03 0.03
17_12d_24hrA 3.25 0.58 0.59
18_12d_24hrA 6.22 0.51 0.52
19_12d_2hrR 0.92 0.09 0.09
1_12d_t0 1.25 0.08 0.08
21_12d_2hrR 0.79 0.03 0.03
22_12d_2hrR 4.43 0.21 0.21
23_12d_2hrR 2.68 0.12 0.12
25_12d_24hrR 4.72 0.42 0.43
28_12d_24hrR 0.32 0.01 0.01
29_12d_24hrR 4.08 0.12 0.12
2_12d_t0 0.25 0.02 0.02
30_12d_24hrR 8.18 0.72 0.73
31_4d_t0 4.99 0.51 0.52
32_4d_t0 10.12 0.81 0.82
33_4d_t0 1.99 0.21 0.21
34_4d_t0 6.72 0.75 0.77
35_4d_4hrA 0.43 0.03 0.03
36_4d_4hrA 0.59 0.06 0.06
37_4d_4hrA 1.10 0.14 0.14
38_4d_4hrA 5.67 0.69 0.70
39_4d_24hrA 8.32 0.81 0.83
40_4d_24hrA 7.77 0.87 0.88
41_4d_24hrA 7.50 0.60 0.61
42_4d_24hrA 2.13 0.28 0.28
43_4d_2hrR 0.61 0.06 0.06
44_4d_2hrR 1.00 0.08 0.09
45_4d_2hrR 4.69 0.31 0.31
46_4d_2hrR 1.07 0.13 0.13
47_4d_24hrR 16.68 2.16 2.19
48_4d_24hrR 0.18 0.02 0.02
49_4d_24hrR 1.49 0.27 0.28
50_4d_24hrR 2.22 0.17 0.17
5_12d_t0 6.11 0.46 0.47
6_12d_t0 6.47 0.62 0.63
7_12d_4hrA 5.39 0.35 0.36

Another QC that we could fo with the aligned read is with Picard. The following is the command line to do that:

FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*sorted.bam"
for f in $FILES
    do
        b=`basename $f`
        echo Converting file $b
        o="$b.picard.txt"
        echo Output file is $o
    picard -Xmx5g CollectAlignmentSummaryMetrics I=$f O=$o R=/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/GCF_001266775.1_Austrofundulus_limnaeus-1.0_genomic_andMITO.fna
    done
#done

Before we proceed, we need to sort the sam, convert it into bam and create a bam index (bai). This step ensures that we could store the alignments in acceptable binary format that takes less storage space and is compatible with subsequent steps.

FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.sam"
for f in $FILES
    do
        b=`basename $f`
        echo Converting $b to bam
        o=${b::-4}
        bam="$o.bam"
    sortedbam="$b.sorted.bam"
    
    echo Tmp file is $bam
        samtools view -b $f > $bam
    
        echo Output files are $sortedbam and its index
    samtools sort $bam -o $sortedbam
    samtools index $sortedbam
    echo Deleting Tmp file $bam 
    rm $bam 
    done
#done

Using aligned reads

We will create three files:

  1. Summary information that has read ID, sequence, reference location, and number of alignments.
  2. Exact location, genomic features of the reference location that the reads mapped to via bedtools.
  3. Counting the readcounts from bam files.
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        o="$o.summary.txt"
    input_file="$f"
    output_file="$o.summary.txt"

    # Process the input file with samtools and awk
    samtools view $f | awk '{print $1, $10, $3, $4, $15}' | sort | uniq > "$output_file"
    echo "Output file written $output_file"
    done

The result is the following data for each file. The sample 1_12d_t0 is shown.

FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        bed="$o.bed"
    out="$o.bedtools.txt"
    gff="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/mito_genome/GCF_001266775.1_Austrofundulus_limnaeus_MITOonly.gff"
    bedtools intersect -a $f -b $gff -wo -f 1 -bed > $bed
    awk '!seen[$4]++' $bed > tmp.csv
    rm $bed
    awk '{print $4, $21}' tmp.csv > $out
    rm tmp.csv
    echo "Output file written $out"
    done

The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown.

FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        o="$o.readcount.txt"
        echo Output file is $o
    # Count unique occurrences of column 10
    samtools view $f | awk '{print $10}'| sort | uniq -c > "$o"
    echo "Output written to $o"
    done

The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown.

miRTrace

Running mirtrace with the following command.

mirtrace qc --species dre --config seq_address.txt

Looking at the results:

sports

Executing sports tool on mapped reads only. We use the following command to run:

#!/bin/bash
sports.pl -i seqs_alim_mito.txt \
-o /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/out_sports/REDO/mito_mapped_reads \
-k \
-M 2 \
-p 20 \
-g /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-m /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/miRBase_21/miRBase_21-dre \
-r /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/rRNA_db/zebrafish_rRNA \
-t /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/GtRNAdb/danRer6-tRNAs \
-w /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/piRBase/piR_dre_v1.0 \
-e /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Ensembl/Danio_rerio.GRCz10.ncrna \
-f /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Rfam_12.3/Rfam-12.3-zebrafish

This will generate a number of output files, from which the tables generated as a result file *_output.txt for each sample are taken and are used to extract annotation of mapped reads. Top 50 rows of this table are shown for the sample 10_12d_4hrA below.

It also generates a number of plots to categorize the annotations and their length distributions for each sample.

Compiling output files

At the end, we need to compile all the output files into a single dataframe that can be exported to further do downstream processing such as statistics, differential expression, and clustering.

We provide an R script that compiles these output files:

  1. Annotation: From sports- *.mapped_output.txt
  2. Alignment summary: From .bam files with samtools- *.summary.txt
  3. Readcount: From .bam files with samtools and awk- *.readcount.txt
  4. Location: From .bam files with bedtools- *.bedtools.txt

In addition, we also need a sample design file (.csv format) that has details about the samples. This file makes it easier to do statistical analysis on the data using R packages like DESeq2, ExpressionSet, and EdgeR.

You would alter the following lines in the R script that basically tells the program where these txt files are located.

parent_dir <- "D:/PSU_work/smrnaseq/mito_map"
samples <- "D:/PSU_work/smrnaseq/mito_map/design_samples.csv"
readcount_txt_pattern <- "sorted.readcount.txt$"
output_txt_pattern <- "mapped_output.txt$"
summary_txt_pattern <- "summary.txt$"
bedtools_txt_pattern <- "bedtools.txt$"

Set pattern of filenames: these files are searched in the parent_dir recursively. Make sure that the parent_dir contains these files and the pattern should be unique to list the files that we need.

Visualizing

We use the alignments with mitochondrial genome to view the aligned reads.

Jbrowse is used to explore alignments.

Files used:

  1. Aligned bam files for each sample and its corresponding index file (bam, bai)
  2. Reference genome for mitochondria (.fasta) and its index file (.fai)
  3. Genome feature file (.gff) for mitochondrial genome

Click to enlarge

This figure shows mitochondiral genome of 100bp length from 17100 to 17199. On this genome, we see 1) the colored basepairs, 2) Genomic features of the mentioned length, 3) Alignments from 4 samples (4d t0 replicates) that show reads aligned to the genome colored by strand while also showing the coverage histogram in grey.

---
title: "mitosRNAseq Workflow"
date: 08-15-2024
output: 
    html_notebook:
        toc: TRUE
        toc_float: true
---


##  Summary
This workflow demonstrates the mitosRNAseq pipeline which identifies several types of
small RNA from sequencing data from short reads of <50 nt. The following method collects sequence-based 
counts, sequence locations, genomic features, and annotations. 
The data we use is from killifish embyros anoxia experiments. 

The step-by-step workflow we follow is illistrated below and a 
detailed method to perform small RNA seq on an academic HPC is decsribed in this document. 

![](/home/gazal/Documents/PSU_work/sRNA_gazal/documentation/Figure 1.png)

The following table shows the contents of the file **samples.csv** that contains metadata for our samples.
We start with a total of 40 fastq files.

### Sample Table
```{r echo=FALSE, results='asis', warning=FALSE}
library(knitr)
library(readr)
library(formattable)
library(readr)
df_samples <- read_csv("samples.csv", show_col_types = FALSE)
formattable(df_samples,
        table.attr = 'style = "width:50%; font-size:90%;"'
        )
```

## QC for raw reads

Quality Control (QC) is done by running fastQC on raw reads to evaluate how good (or bad) our data looks.

### Install software for Quality Check

**conda**

Set up a virtual environment in conda that has all of the Quality Control (QC) tools. 
We are using [Miniconda](https://docs.anaconda.com/miniconda/). 
For creating sharable environments, create .yaml files for each environment you're making.

```
conda create -n condafastqc fastqc
conda activate condafastqc
conda install -n multiqc
```

**tmux**

For almost every long-ish process, 
we use tmux to run tasks in a detached terminal window that can keep running if we want to log out or if we accidently log out.
```
tmux new -s fastqc
```

### Executing

Creating a new file called **fastqcr.R** that has the script to run fastqc through R.
We use this script to automate the process for a list of files in a directory. 
Running commands through various scripts also keeps a record of our commands.

```
if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/init_fastq_files",
      qc.dir="/path/to/directory/fastqc/init_fastqc",
      threads=20)
```

Running the script with the following command:
```
Rscript fastqcr.R
```
Press `Ctrl+B` and then `D` for detaching this tmux window. 
You can log out from your remote computer (HPC) and close your terminal 
window now while this process keeps running on your remote computer. 
To get back to this window, log in to your remote computer, and
 use the command `tmux a -t fastqc`. F
or more information, check [tmux](https://github.com/tmux/tmux/wiki/Getting-Started). 

We will get a fastqc HTML file in the output fastqc directory for 
each fastq sequence in the input init_fastq_files directory.

### FastQC Output

**General Statistics from multiqc**
```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("multiqc_init_fastq/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'PercentDups' = color_bar("#E9967A"),
    'PercentGC' = color_bar("#DAA520"),
    'BPMedianReadLength' = color_bar("#66CDAA"),
    'MillionsSeqs' = color_bar("#DA70D6")),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'"
)
```

**Sequence Counts for each sample**

![Sequence Counts](multiqc_init_fastq/multiqc_plots/fastqc_sequence_counts_plot.png){width=550px}

**The mean quality scores per base in the read**

![Per Base Quality](multiqc_init_fastq/multiqc_plots/fastqc_per_base_sequence_quality_plot.png){width=550px}

**The average quality scores per sequence**

![Per Sequence Quality](multiqc_init_fastq/multiqc_plots/fastqc_per_sequence_quality_scores_plot.png){width=550px}

**The GC content per sequence**

![GC conetent](multiqc_init_fastq/multiqc_plots/fastqc_per_sequence_gc_content_plot.png){width=550px}


**Sequence Duplication**

![Duplication in each sequence](multiqc_init_fastq/multiqc_plots/fastqc_sequence_duplication_levels_plot.png){width=550px}

**Overrepresented sequences**

![Total overrepresented sequences found in each sample](multiqc_init_fastq/multiqc_plots/fastqc_overrepresented_sequences_plot.png){width=550px}

**Adapter Content**

![Adapters found at sequence positions](multiqc_init_fastq/multiqc_plots/fastqc_adapter_content_plot.png){width=550px}

**Overall Pass/Fail**

![Checking if each sample passes or fails in various meteris](multiqc_init_fastq/multiqc_plots/fastqc-status-check-heatmap.png){width=550px}

**Sequence length distribution**

All samples had sequences of a single length (50bp , 36bp , 101bp).

## Trimming

Trimming is done with Trimmomatic in our case, however, other tools like cutadapt or fastp may also be used.
This tool runs through the following bash script

```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/init_fastq_files/*.fastq.gz"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b

        c=${b::-9}
        o="$c.trim.fastq"
	log="$c.out.log"
        c="$c.log.txt"
        java -jar /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 20 -phred33 -trimlog $c $f $o ILLUMINACLIP:/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/adapter_contamination_sequences_AR3.txt:2:30:5:1:true SLIDINGWINDOW:5:15 LEADING:20 TRAILING:20 MINLEN:15 2> $log
    done
```

![Reads Surviving](trimmomatic/trimmomatic_plot.png){width=550px}

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("trimmomatic/trimmomatic_plot.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'TotalReads' = color_bar("#BDB76B"),
    'SurvivingReads' = color_bar("#90EE90"),
    'PercentSurviving' = color_bar("#32CD32"),
    'DroppedReads' = color_bar("#F08080"),
    'PercentDropped' = color_bar("#CD5C5C")),
    table.attr = "class=\'table table-condensed\', style=\'width:100%; font-size:90%;\'"
)
```
				

## QC for trimmed reads

Quality Control (QC) is done for the trimmed reads with fastQC. 
We activate the conda environment `condafastqc` for this process.

```
if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/trimmed_fastq_files",
      qc.dir="/path/to/directory/fastqc/trim_fastqc",
      threads=20)
```

Running the script with the following command:
```
Rscript fastqcr.R
```

### FastQC Output

Plots for each FASTQC metric after trimming is shown below. 
All the adapters were removed and overall quality looks better.

**General Statistics from multiqc**
```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("multiqc_trim_fastq/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'PercentDups' = color_bar("#E9967A"),
    'PercentGC' = color_bar("#DAA520"),
    'BPMedianReadLength' = color_bar("#66CDAA"),
    'MillionsSeqs' = color_bar("#DA70D6")),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'"
)
```
**Sequence Counts for each sample**

![Sequence Counts](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_counts_plot.png){width=550px}

**The mean quality scores per base in the read**

![Per Base Quality](multiqc_trim_fastq/multiqc_plots/fastqc_per_base_sequence_quality_plot.png){width=550px}

**The average quality scores per sequence**

![Per Sequence Quality](multiqc_trim_fastq/multiqc_plots/fastqc_per_sequence_quality_scores_plot.png){width=550px}

**The GC content per sequence**

![GC conetent](multiqc_trim_fastq/multiqc_plots/fastqc_per_sequence_gc_content_plot.png){width=550px}


**Sequence Duplication**

![Duplication in each sequence](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_duplication_levels_plot.png){width=550px}

**Overrepresented sequences**

![Total overrepresented sequences found in each sample](multiqc_trim_fastq/multiqc_plots/fastqc_overrepresented_sequences_plot.png){width=550px}

**Overall Pass/Fail**

![Checking if each sample passes or fails in various meteris](multiqc_trim_fastq/multiqc_plots/fastqc-status-check-heatmap.png){width=550px}

**Sequence length distribution**

![length distribution of sequences](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_length_distribution_plot.png){width=550px}

## Alignment

First things first- create a conda environment that has bowtie, samtools, picard, and bedtools.

`conda create -n condabowtie bowtie samtools picard bedtools`

The trimmed reads are aligned to the reference genome using bowtie.
To analyze reads mapping with mitochondrial genome, we perform the alignment in two steps.

1. Complete Genome (with mitochondrial genome)- with algenome.fa
2. Only the Mitochondrial genome which is extracted from genome.fa- almitogenome.fa

To begin alignment, we make bowtie indexes, this is done in two sets.

```
bowtie-index algenome.fa
bowtie-index almitogenome.fa
```

### Alignment with genome

The alignment was done on the trimmed reads from previous section.

```
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b
        t=${b::-6}
        o="$t.sam"
        uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
	 -p 20 \
     	 -t \
	 -k 5 \
	 --best \
	 --strata \
	 -e 99999 \
	 -v 0 \
   	 -l 15 \
  	 --chunkmbs 2048 \
	 -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_index_alim/algenome \
	 -q $f \
	 --al $uo \
	 --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
```

The following stats show alignment QC. 

![Bowtie Alignment Stats with Genome](alignment/genome/bowtie1_alignment.png){width=550px}

This table shows the number of reads aligned in each sample (MillionAlignedReads). 
It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("alignment/genome/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(	
    'PercentAligned' = color_bar("#B0C4DE"),
    'MillionAlignedReads' = color_bar("#E6E6FA"),
    'MillionMappedReads' = color_bar("#778899"),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'")
)
```
### Alignment with mitochondria

The alignment was done on the trimmed reads from previous section.

The following stats show alignment QC. 

```
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"	
for f in $FILES
    do
        b=`basename $f`
        echo Running alignment for the file $b
        t=${b::-6}
        o="$t.sam"
	uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
	 -p 20 \
     	 -t \
	 -k 10 \
	 --best \
	 --strata \
	 -e 99999 \
	 -v 0 \
   	 -l 15 \
  	 --chunkmbs 2048 \
	 -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
	 -q $f \
	 --al $uo \
	 --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
```

![Bowtie Alignment Stats with Mitochondrial Genome](alignment/mitochondria/bowtie1_alignment.png){width=550px}

This table shows the number of reads aligned in each sample (MillionAlignedReads). 
It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("alignment/mitochondria/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 2)
formattable(df_stats, list(	
    'PercentAligned' = color_bar("#B0C4DE"),
    'MillionAlignedReads' = color_bar("#E6E6FA"),
    'MillionMappedReads' = color_bar("#778899"),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'")
)
```
Another QC that we could fo with the aligned read is with Picard. The following is the command line to do that:

```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*sorted.bam"
for f in $FILES
    do
        b=`basename $f`
        echo Converting file $b
        o="$b.picard.txt"
        echo Output file is $o
	picard -Xmx5g CollectAlignmentSummaryMetrics I=$f O=$o R=/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/GCF_001266775.1_Austrofundulus_limnaeus-1.0_genomic_andMITO.fna
    done
#done

```

Before we proceed, we need to sort the sam, convert it into bam and create a bam index (bai).
This step ensures that we could store the alignments in acceptable binary format that takes less storage space
and is compatible with subsequent steps.


```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.sam"
for f in $FILES
    do
        b=`basename $f`
        echo Converting $b to bam
        o=${b::-4}
        bam="$o.bam"
	sortedbam="$b.sorted.bam"
	
	echo Tmp file is $bam
        samtools view -b $f > $bam
	
        echo Output files are $sortedbam and its index
	samtools sort $bam -o $sortedbam
	samtools index $sortedbam
	echo Deleting Tmp file $bam 
	rm $bam 
    done
#done
```

## Using aligned reads

We will create three files:

1. Summary information that has read ID, sequence, reference location, and number of alignments.
2. Exact location, genomic features of the reference location that the reads mapped to via bedtools.
3. Counting the readcounts from bam files.

```#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        o="$o.summary.txt"
	input_file="$f"
	output_file="$o.summary.txt"

	# Process the input file with samtools and awk
	samtools view $f | awk '{print $1, $10, $3, $4, $15}' | sort | uniq > "$output_file"
	echo "Output file written $output_file"
    done
```
The result is the following data for each file. The sample 1_12d_t0 is shown.
![](screens/summary_1_12d_t0.png)

```#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        bed="$o.bed"
	out="$o.bedtools.txt"
	gff="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/mito_genome/GCF_001266775.1_Austrofundulus_limnaeus_MITOonly.gff"
	bedtools intersect -a $f -b $gff -wo -f 1 -bed > $bed
	awk '!seen[$4]++' $bed > tmp.csv
	rm $bed
	awk '{print $4, $21}' tmp.csv > $out
	rm tmp.csv
	echo "Output file written $out"
    done

```
The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown.
![](screens/bedtools_1_12d_t0.png)


```#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        o="$o.readcount.txt"
        echo Output file is $o
	# Count unique occurrences of column 10
	samtools view $f | awk '{print $10}'| sort | uniq -c > "$o"
	echo "Output written to $o"
    done

```
The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown.
![](screens/readcount_1_12d_t0.png)

### miRTrace
 Running mirtrace with the following command.

```#!/bin/bash
mirtrace qc --species dre --config seq_address.txt
```

Looking at the results:

![](mirtrace/mirtrace-length-plot.png){width=100%}
![](mirtrace/mirtrace-rnatype-plot.png){width=100%}

### sports

Executing sports tool on mapped reads only. We use the following command to run:

```
#!/bin/bash
sports.pl -i seqs_alim_mito.txt \
-o /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/out_sports/REDO/mito_mapped_reads \
-k \
-M 2 \
-p 20 \
-g /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-m /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/miRBase_21/miRBase_21-dre \
-r /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/rRNA_db/zebrafish_rRNA \
-t /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/GtRNAdb/danRer6-tRNAs \
-w /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/piRBase/piR_dre_v1.0 \
-e /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Ensembl/Danio_rerio.GRCz10.ncrna \
-f /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Rfam_12.3/Rfam-12.3-zebrafish
```

This will generate a number of output files, from which the tables
generated as a result file *_output.txt for each sample are taken and 
are used to extract annotation of mapped reads.
Top 50 rows of this table are shown for the sample 10_12d_4hrA below. 


![](sports/Screenshotfrom10_12d_4hrA.png){width=550px}

It also generates a number of plots to categorize the annotations 
and their length distributions for each sample.

![](sports/Screenshot10_12d_4hrA.png){width=550px}
![](sports/Screenshot10_12d_4hrA_graph.png){width=550px}

## Compiling output files

At the end, we need to compile all the output files into a single dataframe that can be exported
to further do downstream processing such as statistics, differential expression, and clustering.

We provide an [R script](scripts/merge_counts_mito.R) that compiles these output files:

1. Annotation: From **sports**- *.mapped_output.txt
2. Alignment summary: From .bam files with **samtools**- *.summary.txt
3. Readcount: From .bam files with **samtools** and **awk**- *.readcount.txt
4. Location: From .bam files with **bedtools**- *.bedtools.txt

In addition, we also need a sample design file (.csv format) that has details about the samples.
This file makes it easier to do statistical analysis on the data using R packages like 
DESeq2, ExpressionSet, and EdgeR.

You would alter the following lines in the R script that basically tells the program where these txt files are located.

```{r, eval=FALSE}
parent_dir <- "D:/PSU_work/smrnaseq/mito_map"
samples <- "D:/PSU_work/smrnaseq/mito_map/design_samples.csv"
readcount_txt_pattern <- "sorted.readcount.txt$"
output_txt_pattern <- "mapped_output.txt$"
summary_txt_pattern <- "summary.txt$"
bedtools_txt_pattern <- "bedtools.txt$"
```

Set pattern of filenames: these files are searched in the parent_dir recursively.
Make sure that the parent_dir contains these files and the pattern should be unique to list the files that we need.

## Visualizing
We use the alignments with mitochondrial genome to view the aligned reads.

Jbrowse is used to explore alignments.

Files used:

1. Aligned bam files for each sample and its corresponding index file (bam, bai)
3. Reference genome for mitochondria (.fasta) and its index file (.fai)
4. Genome feature file (.gff) for mitochondrial genome

![](jbrowse/jbrowse_4d_t0.svg)
<a href="jbrowse/jbrowse_4d_t0.svg" target="_blank"> Click to enlarge</a>

This figure shows mitochondiral genome of 100bp length from 17100 to 17199. On this genome, we see 
1) the colored basepairs, 2) Genomic features of the mentioned length, 3) Alignments from 4 samples (4d t0 replicates)
that show reads aligned to the genome colored by strand while also showing the coverage histogram in grey. 